NASA Contractor Report 191531 
ICASE Report No. 93-64 


/// <£>/ 

/jo ‘J+Z 



ICASE 


Years of 
Excellence 


MULTIPHASE COMPLETE EXCHANGE: 
A THEORETICAL ANALYSIS 


Shahid H. Bokhari 


O' 

00 

rH 

I 

s* 

O 

z 


U) 

u 

c 


oo 

m 

o 

O' 

fH 

o 


o 


NASA Contract Nos. NAS 1 - 1 9480 and NAS 1 - 1 8605 
August 1993 

•— < UJ 

UU K CO 

Institute for Computer Applications in Science and Engineering w < 

NASA Langley Research Center 

Hampton, Virginia 2368 1 -000 1 £ jJJ 


Operated by the Universities Space Research Association 



National Aeronautics and 
Space Administration 

Langley Research Center 

Hampton, Virginia 23681-0001 


•* a. 

tu 0) 
^ o a: 


pn < 

^ I qj 

r 1 o c 

0 X - 

' LL It 

1 

OC UJ </> 

U K h* 

I UJ to 

< -4 >• 

to a j a 

< x < 

Z O Z CO 




Multiphase Complete Exchange: 
A Theoretical Analysis* 

Shahid H. Bokhari 

Department of Electrical Engineering 
University of Engineering & Technology , Lahore, Pakistan 


Abstract 

Complete Exchange requires each of N processors to send a unique 
message to each of the remaining N — 1 processors. For a circuit 
switched hypercube with N = processors, the Direct and Standard 
algorithms for Complete Exchange are optimal for very large and very 
small message sizes, respectively. For intermediate sizes, a hybrid 
Multiphase algorithm is better. This carries out Direct exchanges on 
a set of subcubes whose dimensions are a partition of the integer d . 
The best such algorithm for a given message size m could hitherto 
only be found by enumerating all partitions of d. 

The Multiphase algorithm is analyzed assuming a high perfor- 
mance communication network. It is proved that only algorithms cor- 
responding to equipartitions of d (partitions in which the maximum 
and minimum elements differ by at most 1) can possibly be optimal. 
The run times of these algorithms plotted against m form a hull of 
optimality. It is proved that, although there is an exponential number 
of partitions, (1) the number of faces on this hull is @(\/d), (2) the 
hull can be found in 0(\fd) time, and (3) once it has been found, the 
optimal algorithm for any given m can be found in 0(logd) time. 

These results provide a very fast technique for minimizing com- 
munication overhead in many important applications, such as matrix 
transpose, Fast Fourier transform and ADI. 


*Research supported by the National Aeronautics and Space Administration under 
NASA contracts NAS1-19480 and NAS1-18605 while the author was in residence at the 
Institute for Computer Applications in Science fc Engineering, Mail Stop 132C, NASA 
Langley Research Center, Hampton, VA 23681-0001. 
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1 Introduction 


On a distributed memory parallel computer, the complete exchange or all- 
to-all personalized communication pattern requires each of N processors to 
send a unique m-byte message to each of the remaining N — 1 processors. 
This pattern arises in many important algorithms, such as matrix transpose, 
vector-matrix multiply, Fast Fourier transforms, etc. It is also of importance 
in its own right since it is the densest communication requirement that can 
be imposed on an interconnection network. The time required to carry out 
the complete exchange is, thus, a useful measure of the power of a parallel 
computer system. Finally, in many applications that require a dense com- 
munication pattern that is a subset of the complete exchange, it is usually 
beneficial to use' a highly tuned complete exchange routine rather than at- 
tempting to write specific code for the required communication. 

On circuit switched hypercubes, such as the Intel iPSC-860 and the 
nCUBE-2, there are two basic algorithms for obtaining the complete ex- 
change. For a hypercube with N = 2 d processors, the Standard exchange 
algorithm attempts to minimize the impact of startup time of a message by 
combining several messages into one ‘super’ message and using only d = log N 
message transmissions^ 1]. After each transmission, a shuffle step serves to 
route messages towards their correct destinations. This algorithm suffers 
from substantial overhead of data permutation. 

The Direct algorithm uses N-l carefully scheduled ‘direct’ transmissions, 
relying on knowledge of the routing algorithm used by the hardware to avoid 
message contention[14, 16, 17]. This algorithm has no data permutation 
overhead but suffers from N - 1 message startups. It is demonstrable that 
the Standard exchange algorithm is best for very small message sizes, while 
the Direct algorithm requires minimum time for very large messages[3]. 

Multiphase complete exchange is a hybrid algorithm that combines the 
features of the Standard exchange and Direct algorithms. It carries out the 
complete exchange as a series of ‘partial’ exchanges on a set of subcubes[2, 
4, 9, 10]. It permits a compromise between the message transmission and 
permutation overhead of Standard exchange and the message startups of the 
Direct algorithm. 

The multiphase algorithm has been implemented and shown be useful on 
the iPSC-2 and iPSC-860 hypercubes. For a given hypercube dimension d, 
the number of possible multiphase algorithms equals the number of partitions 
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of the integer d. This is an exponential (though slowly growing) number 
and hitherto the only way to find the best multiphase algorithm for a given 
message size was to enumerate all these partitions. 

In this paper we carry out a detailed analysis of the hull of optimality 
of all such multiphase algorithms. We make the assumption that the time 
to transmit a message from one processor to another is independent of the 
number of communication links traversed. This assumption is valid for most 
high-performance circuit-switched machines. 

Our analysis reveals that only algorithms corresponding to equipartitions 
of d (partitions in which the largest and smallest elements differ by at most 
1) can ever be optimal. Furthermore, the number of potentially optimal 
algorithms is always between 2 \fd — 1 and 3\/d. We show that the hull of 
optimality can be found in 0(\/d) time. Once the hull has been obtained, 
the optimal algorithm for a specific value of message size m can be found in 
0(log d ) time. 

This result provides a very fast method of finding the optimal algorithm 
for a given message size and thus helps in reducing the communication over- 
head in a variety of important parallel applications. The 0(logd) time for 
finding the optimal algorithm is so fast that it may well be feasible to choose 
the algorithm during the course of program execution, based on the dimen- 
sion of the hypercube and the size of the message currently being transmitted. 

In Section 2 of this paper we discuss the complete exchange communica- 
tion pattern and present the three algorithms. Section 3 contains our main 
analysis in which we present our notation, properties of equipartitions, main 
theorems, and obtain bounds on the number of faces on the hull. We con- 
clude with a discussion of the ramifications of our results and suggestions for 
future research directions. 


2 The Complete Exchange 

Complete Exchange requires each of N processors of a parallel machine to 
send a different message to each of the remaining N — 1 processors. This 
pattern arises, for example, when transposing a matrix of N x N blocks that 
has been distributed over N processors, with one column per processor. The 
transpose requires each processor to send a different block to each of the 
remaining processors. The resulting communication pattern is equivalent to 
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the complete directed graph of N nodes. 

The matrix mapping described above is required when using the Alternat- 
ing Directions Implicit (ADI) method for solving partial differential equations 
[6, 13]. This method requires access to the matrix by rows and columns in 
successive phases, necessitating heavy use of a transpose. Matrix-matrix and 
matrix- vector multiplies have similar requirements. Complete exchanges are 
also required in many implementations of the parallel FFT. 

The complete exchange, being equivalent to the complete directed graph, 
is the densest communication requirement that can be imposed on a network. 
The time required by the complete exchange is an upper bound on the time 
required by any other pattern and thus provides a useful measure of the 
power of a distributed memory parallel system. 

2.1 Standard Exchange 

The Standard exchange algorithm was presented by Johnsson & Ho[ll] and 
uses log AT transmissions of size N/2 blocks each. All communications are 
over single links, therefore no attention needs to be paid to the routing al- 
gorithm (in effect, the algorithm does the routing itself). The overheads in 
this algorithm are due to shuffling and the long message sizes that need to 
be transmitted'. Despite this, the algorithm is competitive for small block 
sizes, since the total number of messages it transmits is log N as opposed to 
N — 1 for the Direct algorithm. 

procedure Standard_Exchange; 
begin 

for j = d — 1 downto 0 do 
begin 

if (bit j of mynumber = 0) then 
message = blocks n/2 to n — 1 
else 

message = blocks 0 to n/2 — 1; 
send_message_to_processor((mynum6er) © (2 J )); 
shuffle blocks; 
end; 

end; 
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2.2 Direct Algorithm 

The Direct algorithm was first reported (in Japanese) by Take [17] and later 
by Seidel et al. [14, 16]. In this algorithm each processor sends out N — l 
messages, one to each of the remaining processors. The issue is to schedule 
the transmissions such that no edge contention takes place. Assuming the 
almost universal ‘e-cube’ routing algorithm, the exclusive-OR schedule de- 
scribed below achieves contention-free transmission. This algorithm always 
outperforms Standard Exchange for large message sizes. 

procedure Direct; 
begin 

for i = 1 to n — 1 do 

send_block_to_processor((mynum5er) 0 (i)); 

end; 


2.3 Multiphase Complete Exchange 

The multiphase algorithm combines the Standard exchange and the Direct 
algorithms into one unified algorithm. It carries out the complete exchange 
as a sequence of two or more ‘partial’ exchanges. This algorithm has been 
implemented on the iPSC-2 and iPSC-860 [2, 4, 9, 10]. A complete exchange 
on a hypercube of dimension d with n = 2 d processors and block size m is 
done using a set of partial exchanges V — {d u d 2 , ■ ■ ■ ,<4}, on k subcubes, 
where each d t specifies the dimension of the kth. subcube. Obviously |Z>| = k, 
1 < k, and = d. Each partial exchange is called a phase. 

The jth partial exchange is done on the set of subcubes determined by bits 
£; =1 d«- — dj to £; =1 d, of the hypercube node labels. In the partial exchange 
for the ith phase, 2 d ~ d< blocks of m bytes each are transmitted, to each of 
2 d ' — 1 processors. The effective block size is thus m2 d ~ d '. 

procedure Multiphase; 

{ d: dimension of the hypercube 

n: number of phases (subcubes) in partition T> 

dp. dimension of the *th subcube in partition V 
sf«7-t:starting bit of subcube label 
stop: ending bit of subcube label } 
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begin 

start — c? — 1 ; 

for i = 1 to n do 

{Partial exchange) 
begin 

stop = start — d{ + 1; 
compute effective blocksize; 
for j = 1 to (: 2 start - st °P+ 1 - 1) do 

send_effective_block _to_processor((mj/nuro&er) © (j2 5iop )); 
shuffle blocks d; times; 
start = siop — 1; 
end; 

end; 

In the above algorithm, when & = d, all d t s are 1. In this case the outer i 
loop is executed k times with start = stop = d — l,d — 2, •■•,1,0. The inner 
j loop is executed only once for each z. In this case Multiphase degenerates 
into Standard exchange. When k = 1 and therefore d\ = d, the outer loop is 
executed only once, siop always equals 0 and, in the inner loop, j takes on 
the values 1, 2, • • • , 2 d — 1 and thus Multiphase becomes Direct. 

In our analysis, we have assumed that the complete exchange corresponds 
exactly to a transpose. Thus not only do blocks have to be transmitted among 
processors but each block needs to be placed in memory in the destination 
processor in its ‘correct’ transposed position. This accounts for the shuffle 
at the end of the last partial exchange. When there is only one phase, i.e. 
the algorithm corresponds to Direct exchange, the last set of d shuffles is 
equivalent to the identity permutation and is redundant. In the interest of 
simplicity, this has not been excluded from our analysis. 

2.4 Implementation 

A detailed evaluation of the performance of the Standard Exchange and Di- 
rect algorithms appears in [3]. The multiphase algorithm has been evaluated 
in [2, 4], wherein it has been shown that this approach can improve perfor- 
mance by as much as a factor of 2. 
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3 Analysis of the Multiphase Algorithm 

The performance parameters characterizing a typical hypercube architecture 
are given in Table 1. r is the time to transmit one byte while p the time to 
move a byte from one memory location to another, on the same processor. 
A is the startup time, the time that elapses from issuance of a transmit 
request to initiation of transmission of the first byte. 8 is the distance impact, 
that is the time required for a message to travel across the communication 
network of the processor. We assume this to be independent of the number 
of communication links traversed. 

We omit the overhead of processor synchronization from our analysis. 
Each phase of our algorithm takes a precise amount of time. If all processors 
keep their clocks synchronized, there is no need for a global synchronization 
operation between phases, as the time to start a new phase can be com- 
puted by each processor independently. The issue of clock synchronization 
on hypercubes is discussed in [7]. 


Table 1: Performance parameters of a hypercube 



Description 

Units 

r 

transmission 

time per byte 

p 

data permutation 

time per byte 

A 

startup (latency) 

time per message 

8 

distance impact 

time per message 


The time taken for a message of size m bytes is rra + A + 8. The Standard 
Exchange algorithm requires d transmissions of m2 d_1 bytes each, and d 
shuffles on 2 d blocks of m bytes. This leads to 


^standard = d(rm2 d 1 + A + 5) + dp2 d m 


= d 


(g + p)^ dm + (^ + ^) • 


The Direct algorithm needs 2 d — 1 transmissions of m bytes each, giving 
us 


^direct = (2 d — 1) [mi + (A + £)] . 
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A Multiphase algorithm, with n phases of size d{ each, requires for the ith 
phase 2 d ' — 1 transmissions of m2 d ~ di bytes each, followed by a permutation 
of 2 d bytes. Thus 

id* = - 1)(X + + 6) + pm2 d 

= {(' - J<-) T + W *>»+(2*-l)(A + «). (1) 

Since 5Z" =1 di — d , the total time required by the Multiphase algorithm is 

n 

^multiphase 

i= 1 

= + P}2“m + (2“< - 1)(A + £). (2) 

1=1 


3.1 Finding the best Multiphase algorithm 

In our presentation of the multiphase algorithm, we have not stated which 
of the many possible partitions of the integer d is best in terms of total time. 
The total number of partitions of the integer d is approximated by: [1, 8] 


p(d) 




1 Wy/2/3Vd 
4 y/U 


which is a slowly growing exponential, with p(20) = 627. It is feasible, though 
neither efficient nor elegant, to enumerate all partitions of d to find the best 
algorithm using the expression for t mu iti P base (2). Furthermore, ^multiphase is 
not convex for n = 2. It is therefore not possible to find the best partition 
by recursively halving d. 

The objective of this paper is to carry out a detailed investigation of the 
multiphase algorithm. We shall be concerned with the hull of optimality 
formed by the straight lines that describe the run times of all possible multi- 
phase algorithms on a hypercube of dimension d plotted against the message 
size m. We shall show that a large class of algorithms can never be optimal. 
Of the remaining algorithms, a large fraction are optimal only at vertices of 
the hull of optimality and can be ignored. These results permit us to obtain 
a bound of Q(y/d) on the number of optimal algorithms. 
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3.2 Notation 


Let [nj, ai][ri 2 , < 12 ] [^ 3 > < 13 ] • • • denote the sequence made up of nj cti’s, followed 
by n 2 a 2 \ etc. Thus [3, 2] [4, 3] [2, 5] denotes the sequence {222333355}. The 
elements of a sequence shall always be enumerated in non-decreasing order. 

Let the calligraphic letter A d>n denote an arbitrary partition of the integer 
d with cardinality n. The elements of this partition are denoted by the 
lowercase letters a,. We shall omit subscripts when they are irrelevant to 
the discussion. Example: two of many possible cardinality 6 partitions of 
the integer 30 are {224679} and {115788}. Table 2 shows the partitions of 
d — 5. Define an equipartition of the integer d to be a partition in which 


Table 2: Partitions of the integer 5. 




5 


1 

4 


2 

3 

1 

1 

3 

1 

2 

2 

1 1 

1 

2 

1 1 1 

1 

1 


the largest and smallest elements differ by at most 1. An equipartition of d 
with cardinality n is denoted £ d<n . By definition, S d , 1 = d. In Table 2 the 
cardinality 3 equipartition is {122}. 

It is straightforward to verify that 


d 

d ' 

n — d mod n, — 

d mod n, f— 1 

n 

n 


( 3 ) 


For example 9,8 = {22222333} = [5, 2 ] [3, 3]. Since the cardinality n equipar- 
tition of an integer d is unique, there are d unique equipartitions of the integer 
d , 

The time taken by a set of partial exchanges corresponding to a partition 
of the integer e < d, M e ,n = {mi, m 2 , ■ • * , m n } on a dimension d hypercube 
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is 

n 

td,M e ,n ~ 

1=1 

In the case e < d, M e , n is not a partition of d and the resultant data move- 
ment is not a complete exchange. Nevertheless this definition is important 
for subsequent analysis. When e = d, M e ,n is a partition of d , and the set of 
partial exchanges corresponding to M e , n constitutes a multiphase algorithm 
for complete exchange, as described above. We shall use the terms ‘algo- 
rithm’ and ‘partition’ interchangeably, so that when we say ‘time required 
by a partition’, we mean the ‘time required by a set of partial exchanges 
corresponding to that partition’. 

Of particular interest to the ensuing discussion is the time required by an 
equipartition, which is obtained by combining (3) and (4): 

td,£ d: n = {n-d mod n)t d + (d mod n)t d ^ (5) 

For a partition A a , n — {oq, a 2 , • • ■ , a n }, we have 

td,A a ,n = ^d,ai 
1=1 

= [(1 — 2~ a ‘ )t + pj 2 d m + (2 a ‘ — 1)(A + 6) + 

[(1 - 2 " a2 )r + p ] 2 d m + (2“ 2 - 1)(A + 8) + 

[(1 - 2 ~ a ")r + p] 2 d m + (2“" - 1)(A + 8) 

= (n - 2 -01 - 2 -a2 2 “ an )r + np\ 2 d m + 

(2 ai + 2“ 2 H h 2 an )(A + 8) 

This prompts us to define, for the partition A a ,n 

2 Aa - n = 2 a ' -f 2“ 2 -| 1- 2“ n 

and 

2~^a,n _ 2 - a l + 2 _ ° 2 + f- 2 _a " 

which then leads to the compact expression 

t dlAa , n = [(« - 2~ Aa ' n )t + np\ 2 d m + (2^" - n)(A + 8). (6) 


9 


Since every element of A is at least 1, the coefficient of m in the above 
expression is > 0 as is the coefficient of (A + S). Thus when td,A is plotted 
against m we obtain a line with positive slope and intercept. 

For an equipartition we have 


^d,£ Ci n 


(n — 2 £ ' ,n )r + npj 2 d m + (2 f ' ,n — n)( A + 5). 


( 7 ) 


3.3 Properties of Equipartitions 

Several properties of 2 f,: ' n and 2 -f ' ,n shall be useful in the ensuing discussion 
and are presented in this Section. In understanding these properties, it is 
useful to refer to Table 3 which lists A e , n , 2' 4e> " and 2 - ' 4e> " for all partitions of 
e = 7. The last column of this table indicates if an entry is an equipartition. 


Table 3: Partitions of the integer e = 7. 


n 

2^7,n 

2~^7,n 

*^7,71 


i 

128 

0.007812 

7 

£7,1 

2 

66 

0.515625 

1 6 


2 

36 

0.281250 

2 5 


2 

24 

0.187500 

3 4 

£ 7,2 

3 

36 

1.031250 

1 1 5 


3 

22 

0.812500 

1 2 4 


3 

18 

0.750000 

1 3 3 


3 

16 

0.625000 

2 2 3 

£ 7,3 

4 

22 

1.562500 

1114 


4 

16 

1.375000 

112 3 


4 

14 

1.250000 

12 2 2 

£ 7,4 

5 

16 

2.125000 

11113 


5 

14 

2.000000 

1112 2 

£ 7,5 

6 

14 

2.750000 

111112 

£ 7,6 

7 

14 

3.500000 

1111111 

£ 7,7 


The first three properties arise from the theory of Schur-convexity[12] 
which we summarize as follows. 
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1. Given X,Y 6 IR n , with £” =1 = S”=i J/»- Let X[,], j/[,j be the fth largest 

component of X,Y, respectively. 

i i 

We say X -< Y if X[,] = i /[,] for all j = 1 , 2 , • • • n. 

i=i i'=i 

2. $ : IR n — > IR is called Schur-convex if, whenever X -< Y, then $(X) < 
* 00 - 

3 . If <7 : IR — > IR is convex then $(X) = £ 2 "=i is Schur-convex. Exam- 
ples of such functions are g\{x) = 2 X ,g 2 (x) = 1/2*. 

Property 1 For any 1 < n < e 

(a) 2 £ '- e < 2' 4 ' n 

(b) 2 _e = 2 _fe - 1 < 2~ Ae ‘ n < §. 

Property 2 

(a) 2 _£<1 ' 2 < 2-^ d . 2 

(b) 2 £d ' 2 < 2^ d ' 2 . 

Property 3 2 £c ' n < 2 £e T1 - 1 . 

Property 4 2 _£ ' " — 2~ £ *' n ~ l < 3/4. 

Proof. £ e ,n-l can always be obtained by deleting the smallest element of 
£ e ,„ and distributing it over the remaining elements of £ e , n - Suppose 

£e,n = {ci , C2, ' ’ ’ , e n } 

and that for some k < n, 


Ci — ^1,1 + + • • • + 


all of which are greater than zero. Then 




-2 


-e . n - , 


2 _e l,l -e l,2 _|_ 

2" C2 (1 - 2 " ei ’ 1 ) + 2 _e3 (l — 2 “ ei - 2 ) -| 2 -e * +1 (1 - 2 " ei '*) + 

2 _efc+2 + j. 2 _en . 
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This is a positive quantity that achieves a maximum when k = 1 and = 1, 
in which case it is 

2” 1 +2" 1 (1 — 2 _1 ) = 3/4. 

■ 

We have mentioned earlier that the time for a partition, when plotted 
against m, leads to a line with positive slope and intercept. The lines corre- 
sponding to the run times of equipartitions are of critical importance in this 
discussion. 

Property 5 Consider the straight lines corresponding to the two equiparti- 
tions £ etn and plotted against m. Then 

(1) td t € e n has greater slope than te en _ lf and 

(2) td t s e ,n h as smaller intercept than ts 

Proof. We have from (7) 

t d f.' „ = [(n — 2 -f '' n )r + ft/o] 2 d m + (2 £ ' ,n — n)(A 4- £) 

j = [(n - 1 - 2~ £e ' n ~ 1 )r + (n - l)/>] 2 d m + - n -f 1)(A + 5) 

slope(t £e n ) - slope(ts € n _ 1 ) = 2~ £en ~ } - 2~ £e - n + 1 

>0 by Property 4 

intercept {ts e n ) — intercept(t£ c n _ ,) = 2 Se,n — 2 fc > n_1 — 1 

<0 by Property 3 


The times taken by equipartitions thus form a hull in which the leftmost 
face corresponds to a partition with maximum cardinality, while the right- 
most face corresponds to a partition of minimum cardinality. Faces of de- 
creasing cardinality lie between these extreme faces. Figure 1(a) shows plots 
of the run times of all partitions (not necessarily equipartitions) of d = 4 
on a hypercube of dimension 4. We can see that the hull of optimality is 
formed by equipartitions {11 11}, {22} and {4}. The non-equipartition {13} 
does not touch the hull. The equipartition {112} touches the hull but does 
not contribute a face (it passes through the point of intersection of {1111} 
and {22}). Figure 1(b) shows the times for all partitions of d = 6 on a hy- 
percube of dimension 6. In this case the hull is formed by the equipartitions 
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Time (millisec.) 


d=4 


d=6 



Message Size (bytes) Message Size (bytes) 

(a) (b) 

Figure 1: Run times for d = 4,6. In this particular example, A = 100,5 = 

10 ( fisec .) and r = 2, p = 1 (/zsec./byte). Circle indicates the point of 
intersection of all partitions of cardinality 2:{33},{24) and {15} 

{111111}, {222}, {33} and {6}. Only a few of the remaining partitions" are 
labeled to avoid a congested plot, but we can see that out of the 11 partitions 
of the integer 6, only the abovementioned 4 equipartitions contribute a face 
to the hull. 

We now prove Properties 6 and 7 which are also illustrated in Figure 1. 

Property 6 For any d, 

(a) Ed , i always lies on the hull, and 

(b) Ed,d always lies on the hull. 
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Proof. Ad, n represents an arbitrary partition of cardinality n. From (6) and 
(7) we have 


t d,e d , x = 
t Wi.d = 

(a) The expression 


[(n - 2 Ad ' n )r + np] 2 d m + (2^- - n)(A + £) 
(1 - 2^' )t - /»] 2 d m + (2 f<M - 1)(A + $) 

\d - 2 ?**)t - dp] 2 d m + (2 £d - d - d )( A + 6) 


^d t A di7l 


= [(n - 1 - 2-^» + 2"^' )r + (n - l)p)J + (2^- - 2 £d -' - n + 1)(A + 6) 

> [(^ + 2T d - l)r + (n - l)p] 2 d m + (2 Adn - 2 £d - 1 - n + 1)(A + 6) 

(by Property 1(b)) 


which is always positive for sufficiently large m and n > 1 (for n = 1 , Ad, n — 
£ d ,\ and the property hold vacuously). Thus £ dti lies below any A d , n for 
sufficiently large m. 

(b) At m = 0, the expression 

td,A din ~ td,e d d = ( 2 Ad ' n — 2 £d ' d — n + d )( A + £) 

is greater than zero, since 2 Ad > n > 2 £ < d (by Property 1(a)), and d> n). Thus, 
£d,d lies below A d<n for m = 0. ■ 

The partition £ d p corresponds to the Direct algorithm, while £ d}d is equiv- 
alent to the Standard exchange. These two algorithms are extreme cases of 
the Multiphase algorithm. Property 6 tells us that the Direct algorithm is 
always optimal for large values of m, while Standard exchange is always best 
for very small values. 

Property 7 Of all partitions of cardinality 2, only £ d ,2 can He on the hull. 

Proof. Consider the two partitions £ d p = {e, d— e} and A d p — {a,d — a}. 
We have 


t d ,s d , 2 = [(2 — 2 £d ’ 2 )T + 2p] 2 d m + (2 fd - 2 )(A + 5) 

t d ,A d>2 = \{2-2 Ad *)T + 2p}2 d m + (2 Ad *)(\ + 8) 
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Solving for t d>£d2 = t d ,A da we obtain 

(2 a ^ -2 £ < 2 )(X + 6) 

777 ~ 

(2~ a <i,2 —2~ £d ?)T2 d 

(2 Ad * - 2 £d ’ 2 ){\ + 8) 

~ (2~ a + 2~ d + a - 2~ e — 2~ d+e )T2 d 

(2 Ad * — 2 £d ’ 2 )(\ + 8) 

~ (2 d ~ a +2 a - 2 d ~ e - 2 e )r 

_ (2 Ad < 2 -2 £d ' 2 ){\ + 8) 

~ (2 Ad ’ 2 — 2 £d - 2 )r 

X + 8 


which is independent of £ d -i a °d Ad, 2- Thus all partitions of cardinality 2 
intersect at a point. 

Since 2~ £d ’ 2 < 2~ Ai > 2 and 2 £d > 2 < 2 Ad - 2 (by Property 2), t £d2 has greater 
slope and lesser intercept than t^ d2 . Therefore only t £(l2 can lie on the hull 
for m < (A -f 8)/t. 

At m = (A + 8)/t we have 

td,£ d}2 - td,£ di i 

= [(1 - 2~ £a - 2 + 2“ fd '' )r + p] 2 d — + (2 £d ' 2 - 2 £d ' - 1)(A + 8) 

= [(1 - 2~ e - 2~ d ~ e + 2~ d )2 d + (2 e + 2 d ~ e - 2 d - 1 ) + P 2 d / T ] (A + 8) 
p2 d (X + 8 ) 

T 

which is always positive. This means that the line t £dl always passes below 
the common point of intersection of all cardinality 2 partitions. We have 
already shown that of all cardinality 2 partitions, only S d , 2 can lie on the hull 
below this point. Hence of all cardinality 2 partitions only £ d ,2 can lie on the 
hull. ’ ■ 

In the following, we shall prove that a non-equipartition cannot contribute 
a face to the hull of optimality and further that a large number of equipar- 
titions can at most touch the hull at a vertex. Therefore, although there is 
an exponential number of partitions of an integer d , we shall prove that the 
number of faces on the hull of optimality is 0(\/d). 
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3.4 Main Theorems 


The properties proved above permit us to determine the maximum number 
of faces on the hull of optimality. Table 4 lists all partitions of the integers 


1 • • • 7. 



Turning our attention to the partitions, of 7, we see that if we select all 
those partitions that have a ‘I’ in them (these are boxed in the table) and 
then delete a ‘1’ from each of these, we obtain the partitions of 6, which 
are given in the next column. Similarly, selecting all partitions of 7 that 
have a ‘2’ in them and then deleting a ‘2’ from each of these will result in 
the partitions of the integer 5 and so on. It is thus clear that the set of all 
partitions of the integer d is composed of the union of the sets of all partitions 
of the integers d — a, 1 < a < d, each augmented by a and the integer d. For 
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a specific partition we have 

a,m “I” {^} 

where we take the *+’ operation to mean the addition of an element to a 
partition. The following property is evident. 

Property 8 t d , Ak m = t d>Ak _ a m + t d , {a} . 

It follows that the straight lines describing the run times of all partitions of 
an integer d can be obtained by adding t d ^ a y to the run times of all partitions 
of the integer d — a, and then adding the line te d l . This permits us to prove 
the following Theorem. 

Theorem 1 A non-equipartition cannot touch the hull of optimality. 

Proof. By induction on the partitions of integers < d. 

Basis step: The smallest integer that has a non-equipartition is 4, which 
has only one: {13}. As the basis step of our induction, we shall prove that 
^d,{i3} can never touch the hull. 

The equations for the 5 partitions of 4 are, from (1), 

td,{ 4 ) = 15 A + 15 cr + 2 rf m ) 

^<f,{i3} = 8 A + 8 <r +2 d m ^2 pH — — ^ 

td,{ 22 } = 6A + 6a + 2 d m ( 2 p + 

td,{ 122 } = 5 A + 5<r + 2 d m ^3/> + 

^,{ 1111 } = 4 A + 4 <7 + 2 J m (4 /) + 2 r) 

The point of intersection of t d { 4 } and t d< { 22 } is 

144 (A + <r) 

777 — 

2 d (16/9 + 9 r) 

At this value of m we have 

— 32 p (A + < 7 ) 

U. { <) - t, (.<13} = 16/j + 9t 
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which is always negative. 

The lines ^,{ 22)1 ^,{ 122 } and ^,{ 1111 } intersect at a single point which 
occurs at 

4 (A + cr) 

77 ? zzz 

2* (4 p + rY 

At this value of m we have 

, , _ -((A + <r) (16 /j + 3t)) 

**< nn > " " 2 (4 /? -fi r) 

which is also always negative. Thus the partition {13} can never touch the 
hull of optimality. 

Induction: Suppose the theorem is true for all partitions of the integer 
k < d. Partitions of the integer k + 1 can be obtained by adding 1,2, •• • , k 
to the partitions of the integers k,k — 1, - * - , 1, and then adding Sk, 1 = k 
as discussed above. The corresponding run times are obtained by adding 
td t i, td, 2 , ■ ' ■ , td,k to run times of all the constituent partitions, as stated in 
Property 8. Each time we add to all the partitions of a certain integer 
we raise the hull of optimality and all other lines by a linear amount. The 
resultant hull of optimality of cardinality k + 1 will be the intersection of the 
hulls of cardinality 1,2, • • • , k. A line that did not touch one of the constituent 
hulls cannot touch the intersected hull. 

When a partition is augmented, a new non-equipartition of cardinality k 
can be created by augmenting (1) a non-equipartition of cardinality j or (2) 
an equipartition of cardinality j. In the first case our hypothesis continues 
to hold since a non-equipartition not touching the hull is transformed into 
an non-equipartition that still does not touch the hull. 

The second case requires careful analysis. When an existing equipartition 
of cardinality j,£d,j, that by hypothesis must touch the hull, is transformed 
into a non-equipartition £dj + {& — j} we have two possibilities 

k > 3 Consider the partition obtained by deleting one of the original ele- 
ments, m 6 £d tJ from £dj + {k — j}. This new partition must be a 
non-equipartition of cardinality d — m. In the hull for d — m it could 
not have touched the hull, being ‘masked’ by equipartitions of cardi- 
nality d — m, and therefore it can now also not touch the hull after 
augmentation. 
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k = 2 In this case Property 7 states that only the equipartition of cardinality 
2 can lie on the hull. 

Therefore no non-equipartition can touch the new augmented hull of op- 
timality for k + 1. We have proved that if the theorem is true for k is is also 
true for k + 1. We have already shown that it is true for k = 2,3,4. Thus it 
is true for all k. ■ 

An important consequence of Theorem 1 is the fact. that even though 
there is an exponential number of partitions of d , the total number of faces 
on the hull of optimality cannot exceed d, the number of equipartitions. We 
shall continue with further investigations into the properties of equipartitions. 
These will permit us to improve the bound on the number of faces to O(Vd). 
At this point we prove a theorem that shall permit us to place a lower bound 
on the number of faces on the hull. 

Theorem 2 Every equipartition must touch the hull of optimality. 

Proof. By induction on d. 

Basis step: The theorem is true for d = 2, since by Property 6, both {11} 
and {2} must lie on the hull. 

Induction: Assume the theorem is true for d — n. Then the hull of optimal- 
ity is touched by all equipartitions S n y, l < i < n. The set of equipartitions 
of the integer n + 1, that is £ n +\,i , can be formed from the set of equiparti- 
tions of the integer n, by adding 1 to the smallest element of each and 
then adding the new equipartition £ n+ i in+ i. 

Turning to the corresponding run times, this operation is equivalent to 
adding 

^,{1} = (g 7- + P ) 2<im + (^ + ^) 

to each of t £n 1 < i < n (see equation (1)). Since the same linear expression 
is added to each t £nt , the relationships between these lines is undisturbed and 
the augmented hull is touched by all of the augmented equipartitions. Now 
consider t £n+J n+1 ; this must touch the hull because of Property 6(b). Thus 
the hull of optimality of the integer n + 1 is touched by all equipartitions of 
n+1. 

We have proved the theorem to be true for d = 3. We have shown that if 
it is true for d — n it is also true for d = n + 1. It is therefore true for all d. ■ 
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An equipartition £/, n can only have two distinct elements: \djn\ and 
\djn\. In some cases sequences of several different equipartitions have the 
same two distinct elements. For example, in Table 4, £ 6 ,6 = [6, 1], Sq } 5 = 
[5 , 1 ] [ 1 , 2 ] , £ 6,4 = [2,1] [2, 2] and £e ,3 = [3,2]. All these partitions are com- 
posed of l’s and 2’s exclusively. Similarly, the following equipartitions of the 
integer 19, 

£ 19,7 = {2233333} 

£ 19,8 = {22222333} 

£ 19 ’ 9 = {222222223} 

are all composed of 2’s and 3’s exclusively. We call such equipartitions indis- 
tinct. It is clear that indistinct partitions always have successive cardinality 
values. 

Theorem 3 The run time functions of indistinct equipartitions are linearly 
dependent. 

Proof. Consider three indistinct equipartitions of cardinality p, p + 1 and 
p — 1 that are composed of the elements ft and 17 + 1 . Then for some a, /?, 7 

£d, P = [a, ft] [p-a, ft + 1] 

£d , P + 1 = [/?, ft] \p + 1 - 0, ft + i] 

£d , P - 1 = [7, ft] \p- 1-7, ft + 1] 

The times for these equipartitions are 

ts dp “ odd ft + (p ~ a)td 7 n + 1 
1 = PU ft + (p + 1 “ fl)td y n+i 

t£d,p-i - 7 tdfi + (p - 1 ” 

Since we are dealing with equipartitions of the integer d, 

d = aft + (p — a)(ft + 1) 

= /?f l + (p + 1 — /?)(ft + 1) 

= 7ft + (p - 1 - 7) (ft + 1) 

These yield the following relations 

/? = 1 T a T ft 

7 = — 1 + a — ft 


(8) 

( 9 ) 

( 10 ) 
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( 11 ) 

( 12 ) 



Substituting (11) and (12) in (9) and (10) we obtain the system 

ts dp = odd, n + (p — ot)td,n+i (13) 

ts d , p+ 1 — (1 + cv + n)td,a + (p + ot — fi)f<i,n+i (14) 

^d,p-i = (“1 + oc - Q,)t d ,u + (p - a + fi)td,n+i (15) 

Adding (14) and (15) we obtain 

^d, P+ i + ^d,p-i = 2 at d,n + 2 (p - «)^,n+ 1 

= 2^ d ,p 

Hence the system is linearly dependent. ■ 

Theorem 3 assures us that all members of a set of indistinct equipartitions 
intersect at a single point. Therefore only two of these can contribute faces 
to the hull of optimality, since they have successively decreasing slopes and 
increasing intercepts (Property 5). For example, in the hull for d = 4 (Figure 
1), we can see that the equipartitions {1111}, {112} and {22} intersect at a 
point and only {1111} and {22} contribute faces to the hull. 

3.5 Faces on the Hull 

From the foregoing discussion we can see that all equipartitions touch the 
hull. Each distinct equipartition contributes a single face to the hull while 
each set of indistinct equipartitions contributes two faces. To find a bound 
on the number of faces on the hull, refer to Figure 2 which plots |_d/uj, 
and \d/n\ versus n for d — 11 and 16. In each of these plots, the dashed 
curve represents the the continuous function d/n. The values of \_d/n\, and 
\d/n] are indicated by heavy dots. When [d/nj < \d/n\, there is a vertical 
line joining these dots. In the plot for d = 11 we have enumerated all the 
equipartitions in full, while in the plot for d = 16 we have used the compact 
notation (3). The lines marked with ‘-f’s are tangents, with slope —1, at the 
point |_\/dj , [\/dJ . 

Over the range 1 < n < \/7l the slope of these hyperbolas is less than 
— 1 and therefore no two consecutive equipartitions can have an element in 
common. All equipartitions in this range are distinct and their number is 
equal to the number of integers in this range, which is }v/dj- This equals the 
number of ‘+’s on the tangent between n = 1 and n = [\/dJ. 
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CO 

2 d= 16 




Figure 2: Plots of [d/nj, \d/n\ for d = 11 and 16. 


Indistinct equipartitions can only occur over the range \/d < n < d. 
The number sets of in distinct equipartitions is no more than the number of 
distinct values of [d/nj , which is the number of ‘+’s on the tangent between 
\fd and d , and is again [s/dl . . 

In the range 1 < n < vd, there are no indistinct equipartitions, so one 
face is contributed to the hull by each equipartition, giving us a total of \sfd\ 
faces. In the range \fd < n < d there may be up to [y/d\ sets of indistinct 
equipartitions, each contributing at most 2 faces and at least one face to the 
hull. An upper bound on the total number of faces on the hull is therefore 
3 \_Vd\ . To obtain a lower bound note that the hyperbola is symmetric about 
the line n = d/n (the line through the origin with slope 1). If the point 
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\y/d\,\y/d\ lies on this line the number of distinct levels is 2|_\/dJ — 1 and 
is 2[VaJ otherwise. Each level must contribute at least one face to the hull. 
Thus the lower bound on the number of faces is 2[\/dJ — 1. 

The equipartitions that contribute to the hull can be found by visiting 
all Q(\fd) points on the tangent. For 1 < n < [Vd], each point corresponds 
to the equipartition £d , n • For each n in the range [Vd\ • • ■ 2[\/dJ there is a 
sequence of indistinct equipartitions extending from \dj(n + 1)] to \_d/n\. 
We need consider only the first and last members of these sequences. Thus 
all partitions contributing to the hull can be found in 0(\/d) time. Once 
these partitions have been found, the vertices of the hull can be discovered 
by computing the intersection points of adjacent partitions, again in Q(y/d) 
time. The intersection points will be computed in order and, once they have 
been stored, the optimal algorithm for any value of m can be found using a 
binary search in O(logd) time. 

4 Conclusions 

We have analyzed the multiphase complete exchange algorithm and shown 
that the total number of optimal algorithms lies between c l\[d — 1 and 3 \fd. 
This holds under the assumption that the time for transmitting a message is 
independent of the number of communication links traversed. High perfor- 
mance parallel machines satisfy this assumption. 

In addition to its theoretical interest, this result is of considerable prac- 
tical importance. It allows us to compute the optimal algorithm for any 
given values of hypercube performance parameters and message length very 
quickly. When dealing with an application where the performance parame- 
ters (Table 1) are fixed and the message lengths for complete exchange vary 
from time to time, the values of message length m at which vertices of the 
hull of optimality occur can be computed ahead of time and stored in a sorted 
list. During the course of program execution, a fast binary search will locate 
the optimal algorithm for the current message size. 

When the performance parameters vary with time, as would happen if 
the communication network were shared among several subcubes, our results 
provide a fast method for computing the optimal algorithm from scratch. 
A related situation is where the same application is run on hypercubes of 
different sizes. 


Among the future directions of this research, the foremost issue is an ex- 
tension to 2 and 3 dimensional meshes. Preliminary results on 2-dimensional 
meshes appear in [5]. Since the time required for ‘direct’ complete exchanges 
on iV-processor 2 and 3-d meshes is 0(N 3 ^ 2 ) and 0(N 4 / 3 ) respectively[15], 
compared to the hypercube’s 0(N ), any improvements will be especially 
welcome. 
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